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Abstract 

A computational method is developed to work on an inverse equilib- 
rium problem with an interest towards applications with protein folding. 
In general, we are given a set of equilibrium configurations, and want to 
derive the most likely potential function that results in these configura- 
tions. The method is applied to polymer simulations and a simple model 
of proteins using protein structures obtained from the Protein Data Bank 
(http://www.rcsb.org/pdbl The resulting energy function is tested on a 
few decoy sets with limited success. 

Introduction 

The Protein Data Bank (http://www.rcsb.org/pdb) [3] has about 19,000 protein 
structures solved by several methods. There are many algorithms that use this 
information to derive understanding about protein interactions. Our method 
is based on using these equilibrium configurations and a maximum entropy 
algorithm to derive information about physical energy functions that could be 
used to approximate protein interactions. 

Our Method 

The following is based largely on an algorithm originally developed by reference 
15] ■ The method is based on the following assumptions. The system is assumed 
to be in thermodynamic equilibrium. For proteins, this was shown to be a good 
assumption by Anfinsen P] . We also assume the energy can be written as a sum 
of terms which are products of parameters and functions of the configuration. 
E(F,P) = J2iPi * ^i(r) = P ■ H where T represents the configuration of the 
system(s). P = {pi} represents the set of parameters to be derived. 

The probability of a configuration, given parameters, is given by the Boltz- 
mann distribution Prob(T\P) = e -E{r,P)/kT j z = e (-E{r,P)+F(P))/kT where 
Z (P) = J2r Ex P(-PE{T,P)) and F(P) = -kTln{Z{?)). 

If we are given the exact equilibrium conformation, T* , the maximum likeli- 
hood of parameter values are those values for which the probability, Prob(T*\P) 
is a maximum wrt P. Maximizing an exponential corresponds to maximizing the 
argument (ignoring the multiplicative constant 0), —E(T*,P) + F(P) — Q(P). 
This also corresponds to extremizing the entropy TS = E — F . 

Our method is basically the multi-dimensional form of Newton's method for 
optimizing functions. Maximizing Q(P), Newton's Method is 

pfe+i _ pk _ D 2(Q(pk}yi . D(Q(P k )) (1) 
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where (D 2 ) 1 represents the inverse Hessian matrix and D represents the gra- 
dient. In practice this is modified slightly, 

pk+i = pk + £ ( A p) ( 2 ) 

where the use of e < 1 corresponds to the " Damped Newton's Method" . 
Using statistical mechanical definitions 

2=fO = _„. 2^1=0 ,3, 

dpi opidpj 

^\ pk = -kT(l/Z)J2-Ph t e~^ r ^ =< h % > (4) 

% r 

d 2 F 

= /?(< hi >< hj>-< hihj >) (5) 

Maximizing Q = —E + F wrt P leads to the following. 

D{Q)i = -K+ <hi> (6) 

D 2 {Q)i,j = P{< hi X hj > - < hihj >) = -j3Cov(hi,hj) (7) 

Resulting in the following iterative equation where VCM(H) is the variancc- 
covariance matrix of H 

AP = kT *VCM{H)- 1 •(< H > -H*) (8) 

The method is easily generalized to a distribution of equilibrium configurations. 

AP = kT* VCM(H)- 1 •(<H>-<H > Prob{r )) (9) 

< ... > represents a Boltzmann average and < ... >p ro h(r) represents an average 
over the given distribution. 

Assuming the least prior information, the iteration starts with all parameters 
set to zero. This would not allow any useful MC evolution at all. The energy 
would be zero for any Monte Carlo move, thus not preferring any particular 
move. The energy is modified for the first few iterations with the addition of a 
clamping term. 

Edamp = ^ P clarnp * (r - r*) 2 (10) 

This term makes the given conformation a minimum of the energy Since this 
conformation is an equilibrium conformation, this seems to be a good approxi- 
mation. Once the parameter values arc sufficiently away from zero, this term is 
set to zero. If a distribution of conformations is given, this distribution can be 
used as the clamping terms. 
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Computation 

The basic algorithm is as follows. 

• Given P k , Monte Carlo simulations and averaging are used to find < hj > 
and < hihj >. 

• This leads to a matrix equation. AP = kT^VCMiH)- 1 »(<H > -H*) 

• Solve this for AP 

• P k+1 = P k + e(AP) 

• Repeat until P converges. 

Computation: Convergence time 

Assuming a system of M proteins, each containing N particles with P param- 
eters. The Monte Carlo time required for useful averaging scales at least as 
TV. The pair- wise energy calculation in each Monte Carlo step scales as iV 2 . 
Although for some models a radius cutoff and neighbor list were used, so this 
effect in practice is less than indicated. Solving a P by P matrix equation is a 
P 3 operation. Computation time for this method scales as 

aM * N 3 + f3P 3 (11) 

Simulation 

A Monte Carlo evolution was done on several systems. Four energy terms were 
used. 

± -, — - — , + 7 — h —r7K + c* \n -r i+1 \ + d* {n -n +1 f (12) 

\r l -r j \ (n-rj) 12 

Non-bonded interactions were included similar to an electrostatic attraction 
and van der Waals repulsion. The sequence of positive and negative charges was 
randomly chosen. A covalent-type bond between one particle and the next in 
the sequence was given a quadratic form. For the simulation described below, 
the following parameters were used a — 2, b = 4, c = —4, d = 1. The results of 
the algorithm applied to this system are shown in figure 1 and table 1. 



3 



"SPEED/n8_mc1 20/pvec.dat" 



^++++++++++++++++++++++++++++++ 



f ++++++++++++++++++++++++++++++++++++++++++++++++++++ 



f ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 



++++++ 



++++ +++++++++++++++++ + ++++ + ++++++++++++ ++ ++++++ 



++++++++++ 



10 



20 



30 



40 
Iteration 



50 



60 



70 



80 



Figure 1: A collection of 99, 8 particle proteins was constructed with a = 2, b = 
4, c = -4, d = 1, kt = 0.25 The algorithm used e = 0.5 and 120 MC moves per 
particle per iteration. Clamping was turned off at time=16 Using a computer 
with dual P3 450-667Mhz this took 1.5 hrs 



Correct Value 


Derived value 


2 


2.01 ± 0.01 


4 


4.36 ±0.03 


-4 


-3.88 ±0.03 


1 


0.97 ±0.01 



Table 1 : Parameters derived from equilibrium configurations compare well with 
correct values. 
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Protein Model 

Several major simplifications were made to allow convergence of real protein 
parameters in a reasonable time. The united residue approximation is used 
in the following model. Each of the residues is treated as one particle. This 
approximation is commonly used [21 El El This greatly simplifies the system, 
but should contain enough complexity to describe the system adequately. 

Proteins seem to have a rugged free energy surface. To minimize this effect 
we use relatively few MC time steps per iteration. This keeps the protein in the 
local minimum of free energy even if parameters are far from correct. 

Residues were placed at the C a location, covalently bond only with next and 
previous particles on the chain. The energy function used for the covalent bonds 
was a normal distribution using the mean and variance derived from the data. 
The noise in the covalent bond parameter seemed to cause large perturbations 
in the convergence for other parameters. These parameters need to be derived 
separately. 

The first model with consistent convergence used a statistical grouping of 
residues developed by Cieplak etal 0]. The groupings were derived using a 
simplification of a statistically derived interaction matrix, the Miyazawa Jerni- 
gan (MJ) matrix |||. This is a simple, consistent grouping which decreases the 
number of parameters. 

• Hydrophobic I (ave hydrophobicity scale value 2.6) (LF1) Leucine, fsoleucine 
and Phenylalanine 

• Hydrophobic H (avc HP scale value 1.8 with large variance) (MVWCY) 
Methionine, Valine, Tryptophan, Cysteine and Tyrosine 

• Polar I (ave HP scale value 1.15) (HA) Histidine, Alanine 

• Polar II (ave HP scale value 0.6) (TGPRQSNED) Threonine, Glycine, 
Proline, Arginine, Glutamine, Serine, Asparagine Glutamic acid and As- 
partic acid 

• Lysine (ave HP scale value 1.9) (K) Lysine 

Only one energy term was used corresponding to van der Waals attraction 

Ae[{o/rr [a/rf] (13) 

Essentially this is a contact energy function, a was determined by comparing 
typical volumes and treating the residues as spheres. This gives radii from 2.4 - 
3.8 Angstrom, a in the equation corresponds to where the core repulsion occurs 
(about 2*radius) so a value of 5 was arbitrarily assumed. 

In summary, the covalent properties were approximated from the mean and 
variance of bond lengths. The energy function has one term, a 6-12 combined 
term. A statistical grouping was used to further reduce parameters. This group- 
ing and energy function model has 15 parameters. Results shown below were 
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derived from proteins ranging in size from 20 to 400 residues. For consistency, 
only X-ray data and only complete proteins containing no extraneous molecules 
were used. The training set contained 821 proteins. All protein structures were 
obtained from the Protein Data Bank Only 20 MC steps per particle were 
used. 



Protein Models - Results 

Energy function: 4e[(er/r) 12 — (a/r) 6 ] with a = 5 



Group 


Hydrophobic I 


H II 


Polar I 


P II 


Lysine 


H I 


0.039 


0.033 


0.039 





0.015 


H II 




0.042 


0.042 





0.038 


P I 






0.033 





0.036 


P II 











0.018 


Lysine 










0.020 



Table 2: Attractive contact energy. Units are kT, u = 5 

The zeros were artificially created, as the algorithm can not handle these pa- 
rameters going negative, (e > 0) This would cause the MC to diverge. Despite 
this limitation, convergence was achieved. These results imply the least hy- 
drophobic (most polar) group essentially has no non-bonded interactions. This 
is very similar to the HP model of polymers where hydrophobic collapse is mod- 
eled as HH atraction and other interactions (HP and PP) are ignored. This 
took 14 days on a Dual PHI 450MHz. 
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Protein Models - Model Evaluation 



This energy function was applied to several decoy sets. Despite the simplicity of 
the energy function, results were mixed with several very encouraging successes. 
All decoy sets were obtained from http://dd.stanford.edu/ 

4 State Reduced Decoy Set [HI] 



Protein 


Rank 


Correct structure 


ave energy 


A/a 


present in data set? 


lctf 


1/631 


16.9 


19.3 


1.9 


n 


lr69 


1/676 


8.7 


9.9 


1.5 


y 


lsn3 


10/661 


16.3 


18.1 


1.3 


n 


2cro 


1/675 


14.4 


15.8 


1.3 


n 


3icb 


25/654 


14.6 


15.5 


0.9 


y 


4pti 


574/688 


12.9 


12.3 


-0.7 


y 


4rxn 


347/678 


43.2 


52.8 


0.5 


y 



Table 3: Ranking of correct structure energy using our energy function using 
the 4state reduced decoy set from Park and Levitt, 1996 



Local Minima Decoy Set (lmds) |7J 

Similar analysis was done for the lmds decoy set from C Kesar and M Levitt, 
1999. These results were not very encouraging. The correct proteins were 
the worst scores in almost all cases and by a large amount. Currently the main 
weakness is due to the difference in derivation for the covalent bonds. This decoy 
set was created using a minimization of a backbone torsional energy function, 
hence was very different from our function. Essentially, our successful decoy set 
predictions are based on only non-bonded interaction calculations. 

Fisa Casp3 Decoy Set |TT] 



Protein 


Rank 


Correct structure 


ave energy 


A/ a 


present in data set? 


lblO 


537/972 


17.6 


18.9 


0.2 


n 


leh2 


1/2414 


24.0 


26.2 


0.5 


n 


Ibg8-A 


1/1201 


16.3 


17.1 


0.3 


n 


ljwe 


1/1408 


18.7 


21.9 


0.7 


n 


smd3 


226/1201 


12.4 


13.3 


0.5 


n 



Table 4: Ranking of correct structure energy using our energy function using 
fisa casp3 decoy set from Simons KT,et al, 1997 [IT]. None were present in the 
data used in the derivation. 
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Protein Models - Discussion 

The energy function used is extremely simple. Despite this, ranking of the native 
protein for some of the decoys was very encouraging. The poor performers 
probably need more complex energy functions. Correlation between RMDS and 
energy was investigated, but no simple relationship was found. For the best 
performers, the native energy was typically isolated at the lowest energy with 
most decoys concentrated a distinct difference in energy away. 

The algorithm has potential, but several problems must be overcome. Com- 
plexity has not been handled very well and may be required for applicability. 
Inclusion of Coulomb, torsional, angular and backbone potential terms are re- 
quired for realistic models. All atom and explicit solvent are further steps that 
can be taken. Terms of smaller magnitude are dominated by effects of terms 
with larger magnitude and have to be separately derived. Similarly, terms with 
smaller frequency of occurrence are dominated by more frequent terms. This 
separate derivation of terms can be organized and iterated consistently. 
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